In recent decades, Philadelphia has experienced substantial changes in population dynamics, economic fluctuations, and redevelopment pressures. Among these changes, housing values have become a key indicator of the socio-economic conditions within the city’s neighborhoods. Rising housing values typically reflect a higher presence of wealthier residents or a neighborhood undergoing gentrification, while declining housing values often signal increased economic pressure or decline. Therefore, accurately forecasting the median housing values across Philadelphia is crucial for urban planners and policymakers, helping them to proactively address potential risks such as displacement and disinvestment. Supporting Philadelphia’s long-term vision of fostering vibrant, diverse, and resilient communities is also essential.
To better understand the potential factors influencing housing values in Philadelphia, we identify several key variables: educational attainment, vacancy rates, the proportion of single-family homes, and poverty levels. These factors are closely related to housing market trends and provide insights into the neighborhood’s socio-economic status.
Educational attainment is closely linked to the socio-economic characteristics of neighborhoods. Individuals with higher educational attainment typically earn higher incomes and contribute more to local economies. A higher concentration of individuals with advanced educational backgrounds tends to increase the demand for well-designed housing in affluent neighborhoods. With a consistent supply, housing prices and values are likely to rise. Similarly, a high poverty rate also indicates a declining community. Since many residents cannot afford luxury housing, the housing values in those neighborhoods are likely to be Lower.
High vacancy rates often correlate with declining neighborhoods and reduced median house values. Research shows that vacant properties affect multiple facets of community life, such as housing and neighborhood vitality, crime prevention efforts, and the well-being of commercial districts. As a result, areas with numerous vacant housing units typically see diminished median housing values.
The proportion of single-family homes in an area influences housing values. These homes are generally private and comfortable, making them more desirable in many U.S. housing markets. However, they are relatively common in suburban neighborhoods and may suffer from low accessibility and limited infrastructure, which can negatively affect their property values as well.
In this study, we utilize ordinary least squares (OLS) regression to analyze the relationship between these socioeconomic factors and median house values in Philadelphia. By examining these relationships, we aim to identify critical predictors of median housing values throughout Philadelphia and offer insights for decision-makers and community initiatives.
To predict median house values in Philadelphia, we obtained the original dataset from the United States Census data. The dataset represents census block groups from the year 2000 and initially contained 1,816 observations. The key variables included:
To refine the dataset for modeling purposes, we applied the following filtering criteria:
Additionally, we removed a specific block group in North Philadelphia that exhibited inconsistencies, with an unusually high median house value (over $800,000) despite a very low median household income (less than $8,000).
After data cleaning, the final dataset contained 1,720 observations.
We will first examine the summary statistics of key variables in the dataset, including the dependent variables MEDHVAL (Median House Value), and predictors NBELPOV100 (Households Living in Poverty), PCTBACHMOR (% of Individuals with Bachelor’s Degrees or Higher), PCTVACANT(% of Vacant Houses), PCTSINGLES( % of Single House Units).
We will examine the mean and standard deviation (SD) of key variables in the dataset.
The mean (\(\bar{X}\)) represents the average value of a variable and is calculated as:
\[ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i \]
where:
- \(X_i\) represents each individual
observation
- \(n\) is the total number of
observations
The mean gives us a single representative value of the dataset, which helps in understanding the typical value for a given variable.
To measure variability, we use the standard deviation (SD), which quantifies how much the values in a dataset deviate from the mean. The formula for the sample standard deviation (\(s\)) is:
\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \]
where:
- \(X_i\) represents each individual
observation
- \(\bar{X}\) is the mean of the
observations
- \(n\) is the total number of
observations
A larger standard deviation indicates that the data points are more spread out, while a smaller standard deviation suggests that the data points are closer to the mean.
We will also examine the histograms and apply log transformations for key variables to assess whether the transformed variables follow a more normal-like distribution.
Histograms provide a visual representation of how a variable’s values are distributed, helping to identify whether the data follows a normal distribution, is right-skewed, or left-skewed. Linear regression model assume that variables are approximately normally distributed.
If the histogram is right-skewed, it suggests that a
small number of observations have significantly higher values compared
to the rest. For variables following a right-skewed
distribution, we will apply a log
transformation to these variables to improve normality. Since
the log transformation is undefined for zero or
negative values, we must first check whether any variable contains
zero.
- If no zeros are present, we apply the standard log
transformation:
\[
X' = \log_{10} (X)
\]
- If zeros are present, we adjust by adding 1 before
taking the log:
\[
X' = \log_{10} (X + 1)
\]
This ensures that all values remain positive and avoids undefined
values.
By comparing the original histograms with the log-transformed histograms, we will assess whether the transformation improves the suitability of the data for predictive modeling.
We will analyze correlations between predictors, to detect potential multicollinearity before proceeding with regression analysis, which can distort model interpretations.
Multicollinearity occurs when predictors are highly correlated, which can lead to unstable regression coefficients. It also inflates standard errors, reducing the statistical significance of predictors, and increases the risk of overfitting, as redundant variables do not contribute new information to the model.
The correlation coefficient \(r\) is calculated as:
\[ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}} \] In a more concise way, this above formula is also equivalent to the following:
\[
r = \frac{1}{n-1} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{S_x}
\right) \left( \frac{y_i - \bar{y}}{S_y} \right)
\] where:
- \(X_i\) and \(Y_i\) are individual data points for
variables \(X\) and \(Y\), respectively.
- \(\bar{X}\) and \(\bar{Y}\) are the mean
values of \(X\) and \(Y\).
- The numerator represents the covariance between \(X\) and \(Y\), while the denominator
normalizes the values.
The correlation coefficient \(r\) ranges from -1 to 1: A value of +1 indicates a perfect positive linear relationship, while a value of -1 indicates a perfect negative linear relationship. A value of 0 suggests no linear relationship between the variables, meaning changes in one do not influence the other.
After getting a general sense of the data, we conduct multiple regression analysis to examine the relationship between the dependent variable, Median House Value (MEDHVAL), and the predictors, education attainment (PCTBACHMOR), number of Households Living in Poverty (NBELPOV100), percentage of Vacant Houses (PCTVACANT), and percentage of Single House Units (PCTSINGLES). Regression analysis is a statistical method to examine the relationship between a dependent variable and one or more predictors. With this type of analysis, researchers can identify the strength and direction of the relationship between variables, make predictions, and assess the significance of predictors. The model also estimates coefficients for each predictor, which represent the expected change in the dependent variable for a one-unit change in the predictor, holding other predictors constant. For this study, the multiple regression model is formulated as follows: \[ \text{LNMEDHVAL} = \beta_0 + \beta_1 \text{PCTVACANT} + \beta_2 \text{PCTSINGLES} + \beta_3 \text{PCTBACHMOR} + \beta_4 \text{log(NBELPOV100)} + \epsilon \] where LNMEDHVAL is the log-transformed median house value, PCTVACANT is the proportion of vacant housing units , PCTSINGLES is the proportion of the single family housing , PCTBACHMOR is the percentage of the residents holding bachelor’s degree or higher, and log(NBELPOV100) is the log-transformed number of households living below the poverty line.
\(\beta_0\) is the intercept, \(\beta_1\), \(\beta_2\), \(\beta_3\), and \(\beta_4\) are the coefficients for each predictor, and \(\epsilon\) is the error term. The coefficient \(\beta_1\), \(\beta_2\), \(\beta_3\), \(\beta_4\) represent the change in the log-transformed median house value for a one-unit change in the corresponding predictor, holding other predictors constant. The error term \(\epsilon\) accounts for the variability in the dependent variable that is not explained by the predictors.
There are several assumptions associated with regression analysis that need to be met for the results to be valid. These assumptions including linearity, independence of observations, homoscedasticity, normality of residuals, no multicollinearity, and no fewer than 10 observations per predictors.
First, Linearity assumes that the relationship between the dependent variable and the predictors is linear. To verify this assumption, we made scatter plots of the dependent variable against each predictor. If the relationship appears to be linear, the assumptions was met.
Second, Independence of Observations assumes that the observations are independent of each other. There should be no spatial or temporal or other forms of dependence in the data.
Third, Homoscedasticity assumes that the variance of the residuals \(\epsilon\) is constant regardless of the values of each level of the predictors. To check this assumption, we made a scatter plot of the standardized residuals against the predicted values. If the residuals are evenly spread around zero, the assumption was met. Any patterns may indicate the presence of heteroscedasticity.
Fourth, Normality of Residuals assumes that the residuals are normally distributed. We examined the histogram of the standardized residuals to check if they are approximately normally distributed. If the histogram is bell-shaped, the assumption was met.
Fifth, No Multicollinearity assumes that the predictors are not highly correlated with each other. We calculated the correlation matrix of the predictors to check for multicollinearity. If the correlation coefficients are is not greater than 0.8 or less than -0.8, the assumption was met.
Finally, No Fewer than 10 Observations per Predictor assumes that there are at least 10 observations for each predictor in the model. Since there are over 1,700 observations in the dataset, this assumption was met.
After verifing the assumptions of the regression, we start the regression analysis. Several parameters need to estimate here: - \(\beta_0\), which is the intercept - \(\beta_1, \dots, \beta_k\), which are coefficients of each independent variables - \(\sigma^2\), the variance of the error terms, which represents the variability in the dependent variable that is not explained by the predictors.
The least squares method estimates the coefficients by minimizing the sum of squared errors (SSE), which is the sum of the squared differences between the observed values and the predicted values. The formula for SSE is:
\[ \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \dots - \hat{\beta}_k x_{ik})^2 \] where \(y_i\) is the actual or observed value, \(\hat{y}_i\) is the predicted value, \(\hat{\beta}_0, \dots, \hat{\beta}_k\) are the estimated coefficients, and \(x_{i1}, \dots, x_{ik}\) are the predictor values for the \(i\)-th observation.
With the Error Sum of Squares (SSE) calculated, we can estimate the variance of the error terms \(\sigma^2\) using the formula:
\[ \sigma^2 = \frac{\text{SSE}}{n - (k+1)} \] where \(n\) is the number of observations and \(k\) is the number of predictors in the model.
We evaluated the model’s fitness using the coefficient of determination \(R^2\) and the adjusted \(R^2\). The coefficient of determination \(R^2\) measures the proportion of variance in the dependent variable that is explained by the predictors. The adjusted \(R^2\) adjusts the \(R^2\) value based on the number of predictors, which help to provide a more accurate measure of model fit for multiple regression, as the increase the number of predictors can artificially inflate the \(R^2\) value.
To obtain \(R^2\), \(SST\) or the total sum of squares, needed to be calculated first. The \(SST\) measures the total variance in the dependent variable, given by: \[ SST = \sum_{i=1}^{n} (y_i - \bar{y})^2 \] Where \(y_i\) is the observed value, and \(\bar{y}\) is the mean of the observed value. Then, the \(R^2\) can be obtained by: \[ R^2 = 1 - \frac{SSE}{SST} \] After that, \(R^2\) is adjusted as follows based on the number of observations \(n\) and the number of predictors \(k\): \[ R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1} \]
In this analysis. several hypothesis tests are conducted to determine the significance of the model and its predictors. The overall significance of the model is assessed using the F-ratio. It compares the variance explained by the model to the variance not explained by the model.A higher F-ratio indicates that the model explains a significant amount of variance in the dependent variable compared to the residual variance.
The null hypothesis and alternative hypothesis for the F-ratio are stated as follows:
The null hypothesis \(H_0\) states that all coefficients are equal to zero, meaning that the predictors do not explain the variance in the dependent variable (median house value). Stated as:
\[ H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0 \]
The alternative hypothesis \(H_a\) states that at least one coefficient is not equal to zero. In our case, this means that at least one predictor explains the variance in the dependent variable (median house value). Stated as: \[ H_a: \text{At least one } \beta_i \neq 0 \]
We also conduct a t-test to determine the significance of each individual predictor in the model:
The null hypothesis \(H_{0i}\) states that the coefficient for the predictor i is equal to zero, meaning that the predictor does not explain the variance in the dependent variable (median house value). Stated as:
\[ H_{0i}: \beta_i = 0 \]
The alternative hypothesis \(H_{ai}\) states that the coefficient for the predictor i is not equal to zero, meaning that the predictor explains the variance in the dependent variable (median house value). Stated as:
\[ H_{ai}: \beta_i \neq 0 \]
In this analysis, we utilize the R programming language, a professional statistical software widely used for data analysis and visualization.The following libraries and packages have been employed to conduct the analysis:
tidyverse: A collection of R packages for data
manipulation and visualization.sf: A package for spatial data processing, especially
for handling shapefiles.ggplot2: A popular visualization package for creating
high-quality map and charts, part of the tidyverse
collection.ggcorrplot: A package designated to visualize
correlation matrices using ggplot2.patchwork: A package for combining multiple ggplot2
plots into a single plot.MASS: A package provides multiple functions and
datasets for statistical analysis, including linear models.caret: A package for creating predictive models and
conducting machine learning tasks, used in cross-validation and k-fold
analysis.kableExtra: A package for creating tables with advanced
formatting options.dependent_var <- "MEDHVAL"
predictors <- c("PCTBACHMOR", "NBELPOV100", "PCTVACANT", "PCTSINGLES")
summary_stats <- data %>%
dplyr::select(all_of(c(dependent_var, predictors))) %>%
summarise_all(list(Mean = mean, SD = sd), na.rm = TRUE) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
separate(Variable, into = c("Variable", "Stat"), sep = "_") %>%
pivot_wider(names_from = Stat, values_from = Value)
summary_stats$Variable <- recode(summary_stats$Variable,
"MEDHVAL" = "Median House Value",
"NBELPOV100" = "# Households Living in Poverty",
"PCTBACHMOR" = "% of Individuals with Bachelor’s Degrees or Higher",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
)
summary_stats <- summary_stats %>%
mutate(
Mean = round(Mean, 2),
SD = round(SD, 2)
)
summary_stats <- summary_stats %>%
arrange(Variable == "Median House Value")
predictor_rows <- which(summary_stats$Variable != "Median House Value")
dependent_rows <- which(summary_stats$Variable == "Median House Value")
# Determine the start and end rows for each group
start_pred <- min(predictor_rows)
end_pred <- max(predictor_rows)
start_dep <- min(dependent_rows)
end_dep <- max(dependent_rows)
# Create the table using kable and add extra formatting
kable(summary_stats, caption = "Summary Statistics",
align = c("l", "l", "l"), booktabs = TRUE, escape = FALSE ) %>%
add_header_above(c(" " = 1, "Statistics" = 2)) %>%
kable_styling(full_width = FALSE) %>%
group_rows("Predictors", start_pred, end_pred) %>%
group_rows("Dependent Variable", start_dep, end_dep)%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)| Variable | Mean | SD |
|---|---|---|
| Predictors | ||
| % of Individuals with Bachelor’s Degrees or Higher | 16.08 | 17.77 |
| # Households Living in Poverty | 189.77 | 164.32 |
| % of Vacant Houses | 11.29 | 9.63 |
| % of Single House Units | 9.23 | 13.25 |
| Dependent Variable | ||
| Median House Value | 66287.73 | 60006.08 |
#check 0
columns_to_check <- c(dependent_var, predictors)
zero_counts <- sapply(data[columns_to_check], function(x) sum(x == 0, na.rm = TRUE))
zero_counts[zero_counts > 0]## PCTBACHMOR NBELPOV100 PCTVACANT PCTSINGLES
## 143 33 163 306
data <- data %>%
mutate(
LNMEDHVAL = log(MEDHVAL),
LNPCTBACHMOR = log(1+PCTBACHMOR),
LNNBELPOV100 = log(1+NBELPOV100),
LNPCTVACANT = log(1+PCTVACANT),
LNPCTSINGLES = log(1+PCTSINGLES)
)longer_version<- data %>%
pivot_longer(cols = c("MEDHVAL", "PCTBACHMOR", "NBELPOV100", "PCTVACANT", "PCTSINGLES"),
names_to = "Variable",
values_to = "Value")
ggplot(longer_version,aes(x = Value)) +
geom_histogram(aes(y = ..count..), fill = "black", alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol = 3, labeller = as_labeller(c(
"MEDHVAL" = "Median House Value",
"PCTBACHMOR" = "% with Bachelor’s Degrees or Higher",
"NBELPOV100" = "# Households Living in Poverty",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
))) +
labs(x = "Value", y = "Count", title = "Histograms of Dependent and Predictor Variables") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))# histograms of the transformed variables
longer_version2 <- data %>%
pivot_longer(cols = c(LNMEDHVAL, LNPCTBACHMOR ,LNNBELPOV100,LNPCTVACANT, LNPCTSINGLES),
names_to = "Variable",
values_to = "Value")
ggplot(longer_version2,aes(x = Value)) +
geom_histogram(aes(y = ..count..), fill = "red", alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol = 3, labeller = as_labeller(c(
"LNMEDHVAL" = "Log Median House Value",
"LNPCTBACHMOR" = "Log % with Bachelor’s Degree",
"LNNBELPOV100" = "Log # Households in Poverty",
"LNPCTVACANT" = "Log % Vacant Houses",
"LNPCTSINGLES" = "Log % Single House Units"
))) +
labs(x = "Value", y = "Count", title = "Histograms of Dependent and log transformed Predictor Variables") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))ggplot(shape) +
geom_sf(aes(fill = LNMEDHVAL), color = "transparent") +
scale_fill_gradientn(colors = c("#fff0f3", "#a4133c"),
name = "LNMEDHVAL",
na.value = "transparent") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8)) +
labs(title = "Log Transformed Median House Value")shpe_longer<- shape %>%
pivot_longer(cols = c("PCTVACANT", "PCTSINGLES", "PCTBACHMOR", "LNNBELPOV"),
names_to = "Variable",
values_to = "Value")
custom_titles <- c(
PCTVACANT = "Percent of Vacant Houses",
PCTSINGLES = "Percent of Single House Units",
PCTBACHMOR = "Percent of Bachelor's Degree or Higher",
LNNBELPOV = "Logged Transformed Poverty Rate"
)
plot_list <- lapply(unique(shpe_longer$Variable), function(var_name) {
data_subset <- subset(shpe_longer, Variable == var_name)
ggplot(data_subset) +
geom_sf(aes(fill = Value), color = "transparent") +
scale_fill_gradientn(
colors = c("#fff0f3", "#a4133c"),
name = var_name,
na.value = "transparent"
) +
labs(title = custom_titles[[var_name]]) +
theme(
legend.text = element_text(size = 8),
legend.title = element_text(size = 10),
legend.key.size = unit(0.3, "cm"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 15, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8)
)
})
# Combine the plots into a grid (2 columns by 2 rows)
combined_plot <- (plot_list[[1]] + plot_list[[2]]) /
(plot_list[[3]] + plot_list[[4]])
combined_plot##
## Call:
## lm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV100, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.25825 -0.20391 0.03822 0.21744 2.24347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.1137661 0.0465330 238.836 < 2e-16 ***
## PCTVACANT -0.0191569 0.0009779 -19.590 < 2e-16 ***
## PCTSINGLES 0.0029769 0.0007032 4.234 2.42e-05 ***
## PCTBACHMOR 0.0209098 0.0005432 38.494 < 2e-16 ***
## LNNBELPOV100 -0.0789054 0.0084569 -9.330 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3665 on 1715 degrees of freedom
## Multiple R-squared: 0.6623, Adjusted R-squared: 0.6615
## F-statistic: 840.9 on 4 and 1715 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: LNMEDHVAL
## Df Sum Sq Mean Sq F value Pr(>F)
## PCTVACANT 1 180.392 180.392 1343.087 < 2.2e-16 ***
## PCTSINGLES 1 24.543 24.543 182.734 < 2.2e-16 ***
## PCTBACHMOR 1 235.118 235.118 1750.551 < 2.2e-16 ***
## LNNBELPOV100 1 11.692 11.692 87.054 < 2.2e-16 ***
## Residuals 1715 230.344 0.134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fitted_values <- fitted(fit)
residuals_values <- residuals(fit)
standardized_residuals <- rstandard(fit)
data <- data %>%
mutate(
Fitted = fitted_values,
Residuals = residuals_values,
Standardized_Residuals = standardized_residuals)ggplot(data, aes(x = Fitted, y = Standardized_Residuals)) +
geom_point(color = "black", size= 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Scatter Plot of Standardized Residuals vs Fitted Values",
x = "Predicted Values",
y = "Standardized Residuals"
) +
theme_minimal() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))ggplot(data, aes(x = Standardized_Residuals)) +
geom_histogram(bins = 30, fill = "black") +
labs(title = "Histogram of Standardized Residuals",
x = "Standardized Residuals",
y = "Frequency") +
theme_minimal() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))longer<-data %>%
pivot_longer(cols = c("PCTBACHMOR", "LNNBELPOV100", "PCTVACANT", "PCTSINGLES"),
names_to = "Variable",
values_to = "Value")
ggplot(longer,aes(x = Value, y = LNMEDHVAL)) +
geom_point(color = "black", size= 0.4) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
facet_wrap(~ Variable, scales = "free", labeller = as_labeller(c(
"PCTBACHMOR" = "% with Bachelor’s Degrees or Higher",
"LNNBELPOV100" = "Logged Households Living in Poverty",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
))) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8)) +
labs(title = "Scatter Plots of Dependent Variable vs. Predictors",
x = "Predictor Value",
y = "Log of Median House Value")join<- data %>%
dplyr::select(POLY_ID, Standardized_Residuals)
shape <- shape %>%
left_join(join, by = c("POLY_ID" = "POLY_ID"))
ggplot(shape)+
geom_sf(aes(fill = Standardized_Residuals), color = "transparent") +
scale_fill_gradientn(colors = c("#fff0f3", "#a4133c"),
name = "Std Residuals",
na.value = "transparent") + # Choose a color palette, invert direction if needed
labs(title = "Choropleth Map of Standardized Residuals") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))custom_labels <- c(
"% of Individuals with Bachelor’s Degrees or Higher" = "PCTBACHMOR",
"% of Vacant Houses" = "PCTVACANT",
"% of Single House Units" = "PCTSINGLES",
"# Households Living in Poverty" = "LNNBELPOV100"
)
predictor_vars <- data[, c("PCTVACANT", "PCTSINGLES", "PCTBACHMOR", "LNNBELPOV100")]
cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")
print(cor_matrix)## PCTVACANT PCTSINGLES PCTBACHMOR LNNBELPOV100
## PCTVACANT 1.0000000 -0.1513734 -0.2983580 0.2495470
## PCTSINGLES -0.1513734 1.0000000 0.1975461 -0.2905159
## PCTBACHMOR -0.2983580 0.1975461 1.0000000 -0.3197668
## LNNBELPOV100 0.2495470 -0.2905159 -0.3197668 1.0000000
rownames(cor_matrix) <- names(custom_labels)
colnames(cor_matrix) <- names(custom_labels)
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("#d73027", "white", "#1a9850"))+
labs(title = "Correlation Matrix for all Predictor Variables") +
theme(plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
axis.title = element_text(size = 8))## Start: AIC=-3448.07
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
## Df Sum of Sq RSS AIC
## <none> 230.34 -3448.1
## - PCTSINGLES 1 2.407 232.75 -3432.2
## - LNNBELPOV100 1 11.692 242.04 -3364.9
## - PCTVACANT 1 51.546 281.89 -3102.7
## - PCTBACHMOR 1 199.020 429.36 -2379.0
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
## Final Model:
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1715 230.3435 -3448.073
lm <- trainControl(method = "cv", number = 5)
cvlm_model <- train(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100, data=data, method = "lm", trControl = lm)
print(cvlm_model)## Linear Regression
##
## 1720 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1376, 1376, 1376, 1376, 1376
## Resampling results:
##
## RMSE Rsquared MAE
## 0.366227 0.6621857 0.2720142
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
cvlm_model_reduced = train(LNMEDHVAL ~ PCTVACANT + MEDHHINC, data = data, method = "lm", trControl = lm)
print(cvlm_model_reduced)## Linear Regression
##
## 1720 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1376, 1376, 1376, 1376, 1376
## Resampling results:
##
## RMSE Rsquared MAE
## 0.4418966 0.5087395 0.3178616
##
## Tuning parameter 'intercept' was held constant at a value of TRUE